Reverse Iterative Deepening for Finite-Horizon MDPs with Large Branching Factors
نویسندگان
چکیده
In contrast to previous competitions, where the problems were goal-based, the 2011 International Probabilistic Planning Competition (IPPC-2011) emphasized finite-horizon reward maximization problems with large branching factors. These MDPs modeled more realistic planning scenarios and presented challenges to the previous state-of-the-art planners (e.g., those from IPPC-2008), which were primarily based on domain determinization — a technique more suited to goal-oriented MDPs with small branching factors. Moreover, large branching factors render the existing implementations of RTDPand LAO∗-style algorithms inefficient as well. In this paper we present GLUTTON, our planner at IPPC2011 that performed well on these challenging MDPs. The main algorithm used by GLUTTON is LRTDP, an LRTDPbased optimal algorithm for finite-horizon problems centered around the novel idea of reverse iterative deepening. We detail LRTDP itself as well as a series of optimizations included in GLUTTON that help LRTDP achieve competitive performance on difficult problems with large branching factors — subsampling the transition function, separating out natural dynamics, caching transition function samples, and others. Experiments show that GLUTTON and PROST, the IPPC-2011 winner, have complementary strengths, with GLUTTON demonstrating superior performance on problems with few high-reward terminal states.
منابع مشابه
LRTDP Versus UCT for Online Probabilistic Planning
UCT, the premier method for solving games such as Go, is also becoming the dominant algorithm for probabilistic planning. Out of the five solvers at the International Probabilistic Planning Competition (IPPC) 2011, four were based on the UCT algorithm. However, while a UCT-based planner, PROST, won the contest, an LRTDP-based system, GLUTTON, came in a close second, outperforming other systems ...
متن کاملEffect of Reward Function Choices in MDPs with Value-at-Risk
This paper studies Value-at-Risk problems in finite-horizon Markov decision processes (MDPs) with finite state space and two forms of reward function. Firstly we study the effect of reward function on two criteria in a short-horizon MDP. Secondly, for long-horizon MDPs, we estimate the total reward distribution in a finite-horizon Markov chain (MC) with the help of spectral theory and the centr...
متن کاملNonlinear Policy Gradient Algorithms for Noise-Action MDPs
We develop a general theory of efficient policy gradient algorithms for Noise-Action MDPs (NMDPs), a class of MDPs that generalize Linearly Solvable MDPs (LMDPs). For finite horizon problems, these lead to simple update equations based on multiple rollouts of the system. We show that our policy gradient algorithms are faster than the PI algorithm, a state of the art policy optimization algorith...
متن کاملApproximation Approaches for Solving Security Games with Surveillance Cost: A Preliminary Study
Security game models have been deployed to allocate limited security resources for critical infrastructure protection. Much work on this topic assumes that attackers have perfect knowledge of the defender’s randomized strategy. However, this assumption is not realistic, considering surveillance cost, since attackers may only have partial knowledge of the defender’s strategies, and may dynamical...
متن کاملSolving Markov Decision Processes via Simulation
This chapter presents an overview of simulation-based techniques useful for solving Markov decision problems/processes (MDPs). MDPs are problems of sequential decision-making in which decisions made in each state collectively affect the trajectory of the states visited by the system — over a time horizon of interest to the analyst. The trajectory in turn, usually, affects the performance of the...
متن کامل